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Abstract 


In this paper we propose a family of tractable 
kernels that is dense in the family of bounded 
positive semi-definite functions (i.e. can ap¬ 
proximate any bounded kernel with arbitrary 
precision). We start by discussing the case 
of stationary kernels, and propose a family 
of spectral kernels that extends existing ap¬ 
proaches such as spectral mixture kernels and 
sparse spectrum kernels. Our extension has 
two primary advantages. Firstly, unlike ex¬ 
isting spectral approaches that yield infinite 
differentiability, the kernels we introduce al¬ 
low learning the degree of differentiability of 
the latent function in Gaussian process (GP) 
models and functions in the reproducing ker¬ 
nel Hilbert space (RKHS) in other kernel 
methods. Secondly, we show that some of 
the kernels we propose require considerably 
fewer parameters than existing spectral ker¬ 
nels for the same accuracy, thereby leading to 
faster and more robust inference. Finally, we 
generalize our approach and propose a flex¬ 
ible and tractable family of spectral kernels 
that we prove can approximate any continu¬ 
ous bounded nonstationary kernel. 


1 Introduction 

Over the past two decades, the use of kernels has been 
at the heart of many endeavours in the statistics and 
machine learning communities. Kernels are often used 
as a flexible way of departing from linear hypotheses in 
learning machines, thereby allowing for more complex 
nonlinear patterns (Vapnik (1995, 1998)). They have 
indeed been successfully applied to problems of classi¬ 
fication, clustering, density estimation and regression. 
The duality between kernels and covariance functions 
has made kernels a critical tool for both frequentist and 


Bayesian statisticians.^ In the Bayesian nonparamet- 
ric community, kernels are often used as a covariance 
function of a Gaussian process (GP), introduced as 
prior over a latent function that is to be inferred from 
the data. The family of covariance functions postu¬ 
lated for the GP is typically chosen so as to express 
prior domain knowledge about the underlying func¬ 
tion, such as periodicity, regularity and range. The 
parameters of the kernel are then learned from the 
data. When one is concerned with automatically un¬ 
covering structures from datasets, a flexible family of 
kernels should be used that can account for intricate 
patterns. In that regards, it is worth noting that, as 
most (if not all) loss functions in kernel methods are 
continuous in the Gram/covariance matrix^, if a fam¬ 
ily of kernels {kg^)KeN* can approximate arbitrarily 
well any continuous bounded kernel, then for any con¬ 
tinuous bounded kernel fc, there exists a kernel fca* in 
the foregoing family that is at least as good as k for 
the problem at hand (i.e. one that achieves a loss at 
least as small as that of k). 

Related work 

Approaches have been proposed in recent years that in¬ 
troduce greater flexibility by combining standard uni¬ 
dimensional kernels through series of compounded op¬ 
erations preserving the positive semi-definite property. 
Examples of such approaches include the hierarchical 
kernel learning model of Bach (2008) , the additive ker¬ 
nels of Duvenaud et al. (2011) and the compositional 
search method of Duvenaud et al. (2013). Although 
these methods may be major improvements on pop¬ 
ular isotropic kernels for some applications, they are 
limited in that they may not approximate every sta¬ 
tionary kernel with arbitrary precision. 

The aforementioned limitation has been addressed by 
spectral approaches such as the sparse spectrum ker- 


^We will use the expressions ’kernel’ and ’covariance 
function’ interchangeably to denote any symmetric positive 
semi-definite function. Unless stated otherwise, kernels in 
this paper are real-valued. 

^E.g. the negative log-likelihood in GP methods, the 
Lagrangian in kernel SVM etc... 





nels of Lazaro-Gredilla et al. (2010) and the spectral 
mixture kernels of Wilson and Adams (2013). Their 
theoretical underpinning, namely Bochner’s theorem 
(Stein (1999); Rasmussen and Williams (2005); Rudin 
(1962)), is particularly helpful to construct flexible 
classes of kernels in that it fully characterises all sta¬ 
tionary kernels with a relatively simple spectral repre¬ 
sentation condition. 

Theorem 1 (Bochner’s theorem) A complex-valued 
function k on is the covariance function of a 
weakly stationary mean square continuous complex¬ 
valued random process on if and only if it can be 
represented as 

Hr) = [ e2--"Xda;), (1) 

Jm<‘ 

where p, is a positive finite measure. 

Bochner’s theorem introduces a duality between the 
flexibility of a class of stationary kernels and the flexi¬ 
bility of the corresponding family of spectral measures 
p. The link between sparse spectrum kernels and spec¬ 
tral mixture kernels can be understood in the light 
of Lebesgue’s decomposition theorem (Halmos (1950); 
Hewitt and Stromberg (1975)). Lebesgue’s decompo¬ 
sition theorem implies that any positive finite measure 
/i, can be uniquely decomposed as 

P — /^cont. Arsing.; (2) 

where /icont. is a finite measure that is absolutely 
continuous with respect to Lebesgue’s measure, and 
Msing. is a finite measure that is mutually singular with 
Lebesgue’s measure^. Examples of positive finite mea¬ 
sures that are mutually singular with Lebesgue’s mea¬ 
sure are the discrete^ symmetric measures: 


When one is concerned with flexibly learning the shape 
of the covariance function from the data, the Fourier 
coefficients need to be inferred directly. For prac¬ 
tical purposes we can only work with a finite number 
K of Fourier coefficients. This gives rise to a simple 
extension of the sparse spectrum kernels introduced by 
Lazaro-Gredilla et al. (2010). The authors capped the 
number of spectral components and required that the 
Fourier coefficients be identical: 

2 

HsH) = (4) 

k=l 

However, this family of kernels has three pitfalls. 
Firstly, they are prone to over-fitting. As an illus¬ 
tration, when used for GP regression, Lazaro-Gredilla 
et al. (2010) proved that such kernels are equivalent 
to Bayesian basis function regression with trigonomet¬ 
ric basis functions. As such, the learning machine will 
aim at inferring the K major spectral frequencies ev¬ 
idenced in the training data. This will only lead to 
appropriate prediction out-of-sample when the under¬ 
lying latent phenomenon can be appropriately char¬ 
acterized by a finite discrete spectral decomposition 
that is expected to be the same everywhere on the do¬ 
main. Secondly, in GP regression, such kernels implic¬ 
itly postulate that the covariance between the values 
of the GP at two points does not vanish as the distance 
between the points becomes arbitrarily large. This im¬ 
poses a priori the view that the underlying function is 
highly structured, which might be unrealistic in many 
real-life non-periodic applications. Thirdly, covariance 
functions of the form of Eq. (3) yield infinite differen¬ 
tiability in the mean square sense. As noted by Stein 
(1999), this is unrealistic for modelling several physical 
processes. 


+C50 

Msing. = 

k=l 


where cvk S Ok > 0, < +oo> 


Random Fourier features methods (Rahimi and Recht 
(2007); Le et al. (2013); Yang et al. (2015)) are closely 
related to sparse spectrum kernels. They are based on 
the observation that Eq. (1) may be rewritten as 


VAcK‘^,(5^(A) = 


r 1 

if X € A 

k{T) =a^ [ e2""’^"P(dw) := tr^Ep 

1 0 

ifx^A 

jR'i ^ ^ 


It follows from Eq. (1) that these measures yield co- 
variance functions of the form: 

+ 00 

^sing.('^) = cos(27ra;fc r). (3) 

k^l 

^That is there exists a partition = 5^^ U5^aing.; 

‘S/Jsing. = 0 such that every subset of is of null 

Lebesgue’s measure and for every subset A C 

Psing.(A) = 0. 

"^Discrete or pure-point measures are measures sup¬ 
ported on a countable set. 


with P = and a'^ = p It then follows that, 

for any symmetric probability distribution P, if the 
frequencies oJk in Eq. (4) are sampled from P, then 
the corresponding sparse spectrum kernel kss{r) (Eq. 
(4)) is an unbiased and consistent estimate of ^(t). Al¬ 
though random Fourier features methods are scalable, 
they do not address the need for flexibly learning the 
spectral measure p from the data and are not applica¬ 
ble to nonstationary kernels. 

The approach introduced by Wilson and Adams (2013) 
focuses on the continuous part /icont. of Lebesgue’s 





decomposition Eq. (2). It follows from Radon- 
Nikodym’s theorem (Halmos (1950)) that /Xcont. ad¬ 
mits a (positive) density / with respect to Lebesgue’s 
measure. Moreover, it is easy to see from Eq. (1) that 
lRcifiT)dT = fccont.(O) < -bcx). Hence, r ^ fJf{r)dT 
is a probability density function, which Wilson and 
Adams (2013) modelled as independent mixtures of 
Gaussians in each dimension of the spectral domain. 
The resulting family of spectral mixture kernels reads: 

K 

fcsM('r) = ^CTfeexp (-27r^||T© 7 fe|p) cos 

k=l 

( 5 ) 

where r G ujk G jk € ak > 0, and 

T©7fe denotes the Hadamard (also known as entrywise) 
product between the vectors r and y/c. 

Although mixtures of Gaussian distributions can be 
used to approximate any distribution, spectral mix¬ 
ture kernels are limited in that, when used as covari¬ 
ance functions, they yield infinite differentiability in 
the mean square sense. Such an excessive smoothness 
assumption might result in poor predictive accuracy. 
Moreover, a large number of spectral mixture compo¬ 
nents might be required to account for lower degrees 
of smoothness evidenced in the data. This would re¬ 
sult in inference techniques that are costlier, and less 
robust to local optima. 

When kernels are used as covariance functions, com¬ 
plex patterns in datasets may also be regarded as evi¬ 
dence of nonstationarity, under the (ergodic) assump¬ 
tion that some properties of a single path, for instance 
the degree of homogeneity, are the same as the corre¬ 
sponding properties considered across random samples 
of the underlying process. This approach is common in 
time series analysis. In Bayesian nonparametrics, non¬ 
stationarity may also be introduced to express domain 
knowledge that vary throughout the input space. How¬ 
ever, commonly used approaches such as the input- 
dependent rescaling of stationary covariance functions 
of Paciorek and Schervish (2004) and spatial deforma¬ 
tion of stationary covariance functions (Sampson and 
Guttorp (1992); Damian et al. (2001); Schmidt and 
O’Hagan (2003)), are application specific in that they 
may not approximate arbitrarily well every covariance 
function. 

The primary contribution of this paper is to propose 
families of spectral kernels we refer to as generalized 
spectral kernels, that (i) we prove can approximate any 
(possibly nonstationary) bounded kernel, and (ii) al¬ 
low inference of the degree of differentiability of the 
corresponding stochastic process when used as a co- 
variance function, or functions in the RKHS in alterna¬ 
tive kernel methods such as support vector machines. 


We show that the only (to the best of our knowl¬ 
edge) existing families of kernels that can approxi¬ 
mate arbitrarily well any stationary kernel, namely the 
spectral mixture kernels of Wilson and Adams (2013) 
and the sparse spectrum kernels of Lazaro-Gredilla 
et al. (2010), are special cases of the families we pro¬ 
pose. 

The rest of the paper is structured as follows. In sec¬ 
tion 2 we introduce generalized spectral kernels, and 
we prove that they can approximate any continuous 
bounded kernel. We start by providing the intuition 
and mathematical background underpinning our ap¬ 
proach in section 2.1. In section 2.2 we introduce sta¬ 
tionary generalized spectral kernels, we prove that they 
can approximate arbitrarily well any stationary kernel, 
we show that they extend existing approaches, and 
we provide examples of stationary generalized spectral 
kernels that allow the learning of the degree of differ¬ 
entiability of latent functions. In section 2.3 we extend 
our approach to nonstationary kernels, and we prove 
that the family of generalized spectral kernels we intro¬ 
duce can approximate arbitrarily well any continuous 
bounded kernel. We provide empirical evidence that 
validates our approach in section 3, and we conclude 
with a discussion in section 4. 

2 Generalized Spectral Kernels 

2.1 Intuition and Background 

The intuition behind our approach is best illustrated 
with stationary kernels that admit a spectral density. 
From a practical perspective, in GP models, these are 
kernels that postulate that the correlation between two 
GP values vanishes as the distance between the points 
increases. Considering that spectral measures are fi¬ 
nite according to Bochner’s theorem, the spectral den¬ 
sity of such a kernel is integrable, and hence admits a 
Fourier transform. In fact, it follows from Bochner’s 
theorem that the spectral density of such a kernel turns 
out to be its Fourier transform, and vice-versa®. 

We are interested in constructing families of integrable 
functions that can ‘approximate’ arbitrarily well any 
such spectral density in the spectral domain, in a sense 
that is intuitive and can easily be shown to yield to 
approximating the inverse Fourier transform (i.e. the 
kernel in the original domain). This will then allow 
us to conclude that the inverse Fourier transforms of 
the approximating functions in the spectral domain 
can approximate any stationary kernel with absolutely 
continuous spectral measure. We would also like the 
family of approximating functions in the original do- 

®We use the ‘real’ frequency convention for the Fourier 
transform: := /^d ^f{x)duj. 





main to approximate arbitrarily well any stationary 
kernel whose spectral measure has a non-null singular 
part in Lebesgue’s decomposition (e.g: sparse spec¬ 
trum kernels). There are two main possible approaches 
for giving a meaning to the notion of approximating 
integrable positive-valued functions: one probabilistic 
and the other deterministic. 

Firstly, noting that integrable positive-valued func¬ 
tions can be normalized to become probability den¬ 
sity functions, approximation can be thought of in the 
sense of the convergence in distribution of random vari¬ 
ables. We recall that convergence in distribution is 
equivalent to pointwise convergence of cumulative den¬ 
sity functions, which does not imply convergence of the 
corresponding probability density functions. Hence, 
approximating in this sense does not guarantee ap¬ 
proximating spectral densities, let alone approximat¬ 
ing their inverse Fourier transforms. Stronger notions 
of convergence of random variables may be used, but 
the resulting links between approximating in the spec¬ 
tral domain and approximating in the original domain 
are more involved. This approach is therefore not suit¬ 
able for our purpose. 

The deterministic alternative has several options, two 
of which are of interest to us. The first notion of ap¬ 
proximation is that of the pointwise convergence of 
functions, according to which a sequence of functions 
ifn)n converges to a function / if and only if for ev¬ 
ery X € the sequence ifn{x))„ converges to f{x). 
The second notion of approximation is the one of the 
strong topology of convergence in the space L^(R‘^) 
of integrable functions, considered with its canonical 
norm: 

V /,5 G \\f-g\\Li= [ \f{x)-g{x)\dx. 

More precisely, we say of a sequence of integrable func¬ 
tions {fn)n that it converges in the sense to an in¬ 
tegrable function / if and only if ||/ —/nlUi converges 
to 0; in other words when the volume between the sur¬ 
faces z = f{x) and z = fn{x) goes to zero. We recall 
that a set Q is dense in a set % with respect to some 
sense of convergence (topology) if any element h G H 
is the limit of some sequence of elements gn in Q. If / 
and fn are integrable, denoting F and their Fourier 
transforms, it follows from Jensen’s inequality that 

\f{x) - fn{x)\ = I [ {F{u;) - F’„(w)) du}\ 

< f \F{u}) - Fn{oj)\dLO 
jR'i 

< ||F’-F„||ii. 

Hence, approximating in the spectral domain in the 
sense implies approximating in the original domain 


in the pointwise sense. More importantly, if a fam¬ 
ily of functions is dense in the space of integrable 
functions (in the spectral domain) with respect to the 
convergence in , then the corresponding family of 
inverse Fourier transforms is also dense in the space of 
integrable kernels (in the original domain) with respect 
to the pointwise convergence of functions. 

2.2 Stationary Kernels 

Conditions for a family of functions to be dense in 
have been extensively studied in the mathemati¬ 
cal analysis literature. The most famous results on 
the matter are known as Wiener’s Tauberian theorems 
(Wiener (1932); Rudin (1962); Korevaar (2004)). We 
recall the theorem that is of interest to us below. 

Theorem 2 (Wiener’s Tauberian theorem) If f is a 
function in a necessary and sufficient condi¬ 

tion for the set of all linear combinations of transla¬ 
tions of f to be dense in (in the sense of the 

convergence in ) is that the Fourier transform of f 

F{uj):=F{f){u;)= [ f{x)e-^”^"^dx 

has no zeros. 

Gaussian probability density functions in the spectral 
domain satisfy the conditions of Wiener’s Tauberian 
theorem, and the corresponding linear combinations 
of translations give rise to the spectral mixture ker¬ 
nels of Wilson and Adams (2013). Wiener’s Taube¬ 
rian theorem however provides a considerably weaker 
condition. We use it in the spectral domain to con¬ 
struct a broad range of families of tractable functions 
in the original domain, that are dense in the family 
of stationary real-valued kernels with respect to the 
pointwise convergence of functions. 

Theorem 3 Let h be a real-valued positive semi- 
definite, continuous, and integrable function such that 
Vt G h(r) > 0. The family of functions 

K 

^K{r) ^ akh{T © jk) cos{27tuj'^ r), ( 6 ) 

k^l 

with G G K, AT G N* is dense in the 

family of stationary real-valued kernels with respect to 
the pointwise convergence. 

Proof Sketch: The functions k^ arise as inverse 
Fourier transforms of linear combinations of trans¬ 
lations of the Fourier transform of /i: F{h). As 

h = F {F{h)), the requirement Vr G /i(t) > 0 
makes Wiener’s Tauberian theorem applicable. See 
Appendix A for the full proof. ■ 



The assumptions of Th. 3 are mostly standard. From 
a practical perspective, the requirement /i(r) > 0 is 
what makes the family of functions approximate arbi¬ 
trarily well the absolutely continuous part of any spec¬ 
tral measure, whereas the continuity assumption im¬ 
plies lim /i(t© 7 ) = h(0) < -|-oo, which allows approx- 

7—»-0 

imating arbitrarily well the singular part of any spec¬ 
tral measure. The parameters jk serve as inverse input 
scales. Noting that /i(r © 7 ^,) cos(27rwJr) is positive 
semi-definite, to restrict ourselves to valid kernels we 
only consider linear combinations with non-negative 
coefficients. We may also further impose h{0) = 1 
without loss of generality. 

Definition 4 Following the notations of Th. 3, we 
denote stationary generalized spectral kernels functions 
of the form: 

K 

Ht) = J2'^kHTOlk)cos{2TTUjlT), (7) 

k=l 

where h(0) = 1 and ct/c > 0. 

Differentiability: When stationary generalized spec¬ 
tral kernels are used as covariance functions, the fol¬ 
lowing proposition establishes the degree of smooth¬ 
ness they induce. 

Proposition 5 A mean zero stationary Gaussian 
process with stationary generalized spectral covariance 
function is p times continuously differentiable in the 
mean square sense if and only if a mean zero station¬ 
ary Gaussian process with covariance function h is. 

Proof See Appendix B. ■ 

Examples: Sparse spectrum kernels correspond to the 
limit case 7 fe —>■ 0 with equal Cfe terms. Moreover, it 
follows from Eq. (5) that the spectral mixture kernels 
of Wilson and Adams (2013) correspond to the special 
case hlr) = exp(—27r^||T|p), which satisfies the condi¬ 
tions of Th. 3, and yields infinitely differentiable GPs 
as a result of Prop. 5. It is easy to verify that the 
Matern kernels 

kMA{T\v) = {\\T\\V2iy^ K^(^\\t\\V^^, 

where P is the gamma function and K^, is the modified 
Bessel function of second kind, satisfy the conditions 
of Th. 3. Hence, Matern spectral kernels 

K 

ksGS-MA{r) ^ ^ crlkuA {r © 7^; cos(27ra;fc r), 
k^l 

with ujk S AT S N* are also dense in the family of 
stationary kernels, and allow learning the differentia¬ 
bility of the underlying latent function from the data. 


2.3 Nonstationary Kernels 

Bochner’s theorem was the cornerstone of the previ¬ 
ous section. The spectral characterisation of station¬ 
ary kernels it provides turned the problem of approx¬ 
imating stationary kernels into that of approximating 
measures in the spectral domain. Luckily, it turns 
out that a more general spectral characterisation ex¬ 
ists that includes nonstationary kernels (see (Yaglom, 
1987, §26.4) and (Loeve, 1994, §37.4) for the univari¬ 
ate case and (Genton, 2002, pp. 308) and (Kakihara, 
1985, pp. 149) for a generalization). 

Theorem 6 A complex-valued bounded continuous 
function k on is the covariance function of a mean 
square continuous complex-valued random process on 

if and only if it can be represented as 

k{x,y)= [ (8) 

where p,p is the Lebesgue-Stieltjes measure associated 
to some positive semi-definite function F(wi, W 2 ) with 
bounded-variations. 

When the spectral measure /i p has mass concentrated 
along the diagonal wi = a; 2 , we recover Bochner’s theo¬ 
rem. We may once again leverage Wiener’s Tauberian 
theorem to construct families of functions in the spec¬ 
tral domain that are dense in ^^(IR^^^) with respect 
to the convergence in L^, so that any spectral density 
can be approximated arbitrarily well. The argument 
developed in section 2.1 may once again be used, in 
conjunction with Th. 6 rather than Bochner’s the¬ 
orem, to demonstrate that this would correspond to 
approximating arbitrarily well any bounded kernel in 
the original domain in the sense of the pointwise con¬ 
vergence of functions. We obtain the following result. 


Theorem 7 Let {x,y) —)■ k*{x,y) be a real-valued 
positive semi-definite, continuous, and integrable func¬ 
tion such that \/x,y, k*{x,y)>0. The family 

K 

kK{x, y) ^ akk'^ix lk)^k{x)'^^k{y) 


where ^k{x) = 


cos {fl'Kx'^U)]. 
rT , ,1 


) -I- cos (2Trx^ujl) 


with 7 fc S 01^,7 G 


sin (f2px'^u)]f) -\- sin (27ra;'^a;^) 


^k) 

^ki^k ^ G K, AT G N* fs dense 

in the family of real-valued continuous bounded non¬ 
stationary kernels with respect to the pointwise con¬ 
vergence of functions. 


Proof See Appendix C. ■ 

The functions k*{xQ'yk,y&Jk)^kix)'^'^k{y) are pos¬ 
itive semi-definite like products of such functions, so 




that to build a flexible family of expressive nonstation¬ 
ary kernels we may simply require ak > 0. We may 
also impose k*{0, 0) = 1 without loss of generality. 

Definition 8 Following the notations of Th. 7, we 
denote generalized spectral kernels functions of the 
form: 


a basis of the RKHS induced by the kernel k. As 
the dimension of the RKHS is finite, exact inference 
may be achieved in 0{nK^) time complexity and with 
0{nK^) memory requirement in most kernel methods, 
where n is the number of samples. Differentiability 
may once again be controlled by taking ki to be a 
Matern kernel. 


K 

Hx, y) = ^ (Djk,y& 7/c)^fe(a;)^^'fe(y) (9) 

with tTfe > 0 and fc*(0, 0) = 1, and relaxing the integra- 
bility condition on k*. 


Remarks: We note that when k* is stationary and 
ujI = ujI = uj we recover stationary generalized spec¬ 
tral kernels. The differentiability discussion of the sta¬ 
tionary case can be extended. In the general setting, 
it is sufficient that a generalized spectral covariance 
function be 2p times differentiable for the correspond¬ 
ing mean zero Gaussian process to be mean square p 
times differentiable (Adler and Taylor (2011)). Simi¬ 
larly to the stationary case, the degree of smoothness 
may be learned or set a priori through the function k*. 

Examples of integrable k*: Silverman (1957) in¬ 
troduced numerous examples of real-valued continuous 
and integrable covariance functions that are so-called 
‘locally stationary’; i.e. of the form 

k*{x,y) = ki {x-y)k 2 > 


where fci is a stationary covariance function and ^2 
is positive-valued. The approaches suggested by the 
author are flexible enough that k* can be constructed 
so that Va;,y, k*{x,y) > 0 and the degree of differen¬ 
tiability may be controlled for instance by taking ki 
to be a Matern kernel. Moreover, with this choice of 
k *, nonstationary random Fourier features approxima¬ 
tions may easily be constructed by noting that Eq. (8) 
may be rewritten in this case as 

k{x,y) = ct^Eq : 


. , , , . ( cos (27rx'^a;i)-I-cos (27ra:^a;2) \ 

withvl;^,,.,(x)= sin(2.x^..J+sin(2.x^.;2) )’ 

tr^ = I where Q is a location-scale mix¬ 

ture of the distribution whose density is the Fourier 
transform of k *, and with mixing probabilities deduced 
from Eq. (9). 

k* may also be chosen to be of the form 


k*{x,y) = ki{x)ki{y), (10) 


where ki is any continuous, integrable and positive¬ 
valued function. In this case, {ki{x)'^k{x)}^^-^ form 


3 Experiments 

In this section SE denotes the squared exponential ker¬ 
nel (i.e. fc(T) = SS denotes the sparse 

spectrum kernel, MA*2 denotes the Matern */2 kernel, 
S-* denotes the stationary generalized spectral kernel 
with modulating kernel (i.e. h in Eq. (7)) the ker¬ 
nel *, and NS-* denotes the nonstationary generalized 
spectral kernel of the form Eq. (10) where ki is the 
kernel *. Moreover, unless stated otherwise all spectral 
kernels have K = 5 spectral components. 

Option pricing: Firstly, we consider modelling the 
evolution of the price of a put option on the STOXX 
Europe 600 Banks index® as a function of time (i.e. 
the theta of the option), through GP regression^. We 
use a third of the data for training and the rest for pre¬ 
diction. As evidenced by Tab. 1, the spectral Matern 
3/2 kernel (S-MA32) improves on the predictive ac¬ 
curacy (RMSE) of the spectral mixture kernel (S-SE). 
Given the density property of S-SE, this suggests that 
on this dataset, the spectral mixture kernel requires 
more parameters than the spectral MA 3/2 kernel to 
achieve the same accuracy, thus making the latter ker¬ 
nel faster and more robust to local maxima during in¬ 
ference. Fig. 1 illustrates the learned posterior mean 
+/- 2 posterior standard deviations for the S-MA32 
kernel. Learned kernels are illustrated in Fig. 2. 


Table 1: Fit and predictive accuracy on the option 
experiment. 



SE 

S-SE 

S-MA52 

S-MA32 

Log. Lik. 

-28.56 

-18.61 

-19.39 

-19.60 

RMSE 

0.89 

0.90 

0.76 

0.64 


Air temperature anomalies: Our second experi¬ 
ment is based on the well studied temperature anoma¬ 
lies dataset of Wood et al. (2002). The dataset consists 
of monthly readings of air temperature anomalies at 
various points on the globe in December 1993. The au¬ 
thors defined air temperature anomaly as the deviation 

®We use the strike 195, maturity June 2015 put option 
on the STOXX 600 Banks index. The data originate from 
Bloomberg (security code SX7P 6 P195). 

^Kernel hyper-parameters are learned by maximizing 
the marginal log-likelihood. 








S-MA32 



Figure 1: Posterior mean ±2(T in the option experi¬ 
ment with S-MA32 kernel. 



T (yrs) 


Figure 2: Learned kernels in the option experiment. 

of a monthly temperature at a given location from the 
average over the period 1950-1979 of the monthly tem¬ 
peratures at the same location. There were 445 read¬ 
ings in December 1993. We selected 2/3 of the data 
at random for training and predicted the left-out tem¬ 
perature anomalies using GP regression. Training and 
predictive results are summarized in Tab. 2. We found 
that the spectral Matern 1/2 kernel outperforms com¬ 
peting kernels including the spectral mixture kernel 
(S-SE), which evidences that the latent anomaly func¬ 
tion is best modelled as continuous but not smoother. 
Fig. 2 illustrates a map of the posterior mean of the 
temperature anomaly, and a map of the learned cor¬ 
relation between the temperature anomaly in London 
and elsewhere on the globe under the spectral Matern 
1/2 kernel. 

Approximating nonstationary kernels: Finally, 
we consider approximating a nonstationary kernel in 
order to demonstrate the need for nonstationary gen¬ 


Table 2: Training log-likelihood and predictive accu¬ 
racy on the air temperature anomalies dataset. 



SE 

S-SE 

S-MA12 

S-MA32 

Log. Lik. 

-358.58 

-341.68 

-316.89 

-326.10 

RMSE 

1.34 

1.31 

1.28 

1.29 


eralized spectral kernels. The nonstationary kernel of 
interest is the covariance function of a time-inverted 
fractional Brownian motion (IFBM): 

fciFBM(i,'®) = 2 

with t, s>0, 0 < h < 1. Such kernels might be par¬ 
ticularly useful to model continuous latent functions 
with known long range behaviour and uncertain short 
range behaviour (e.g: the price of an option contract 
as a function of time, value functions in dynamic pro¬ 
gramming). Higher values of the Hurst index h re¬ 
sult in more volatile increments and rougher paths of 
the corresponding IFBM. We approximate fciPBM on 
(0.01,1] with generalized spectral kernels for several 
values of h. The parameters of approximating kernels 
are learned by minimizing the sum of the square er¬ 
rors between fcipBM and the approximating spectral 
kernel, both evaluated on a uniform grid with mesh 
size 0.02. Tab. 3 illustrates the corresponding root 
mean square errors normalized by the average value 
of fciFBM on the grid for different values of the Hurst 
index, and with K = 5 spectral components. Fig. 
4 illustrates sections of the learned kernels along the 
vertical plane s = 0.5 for 3 different Hurst indices. 
It can be seen that nonstationary spectral kernels 
considerably outperform stationary alternatives such 
as the spectral mixture kernel and the sparse spec¬ 
trum kernel. This comes as no surprise given that 
Vit, fciFBM(0.5, 0.5 -\- u) ^ fciFBM(0.5,0.5 — u), which 
cannot be modelled by stationary kernels. More im¬ 
portantly, it can be seen at a glance in Fig. 4 that, 
with only 5 spectral components, nonstationary gen¬ 
eralized spectral kernels approximate the IFBM kernel 
pretty well in absolute terms, which is consistent with 
the density property discussed in the previous section. 


4 Conclusion 

We propose families of kernels we refer to as gener¬ 
alized spectral kernels that we prove can approximate 
arbitrarily well any continuous bounded kernel. As 
a result, given that loss functions in kernel methods 
are often continuous in the kernel/Gram matrix, gen¬ 
eralized spectral kernels (out of the box) can perform 
























h = 0.4 


Posterior Mean Temperature Anomaly (5-MA12) 



■c 


Correlation with Temperature Anomaly in London (5-MA12) 



Figure 3: Posterior mean temperature anomaly and 
learned correlation function between the temperature 
anomaly in London and elsewhere on the globe under 
the spectral Matern 1/2 kernel. 

Table 3: Normalized RMSE of approximations of the 
time-inverted fractional Brownian motion kernel with 
Hurst index h by various spectral kernels. 


h 

S-SE 

ss 

NS-SE 

NS-MA12 

NS-MA12 

0.2 

0.22 

0.26 

0.09 

0.08 

0.08 

0.5 

0.48 

0.49 

0.09 

0.07 

0.08 

0.8 

0.37 

0.37 

0.10 

0.09 

0.08 


as well as (if not better than) any hand-crafted ker¬ 
nel of practical interest, stationary or not. We show 
that the only (to the best of our knowledge) families of 
kernels that have previously been proposed (that can 
approximate arbitrarily well any stationary kernel) are 
special cases of generalized spectral kernels. Critically, 
our extension improves on competing approaches in 
that it allows learning the degree of smoothness of the 
latent function in Gaussian process models, or that of 
functions in the RKHS in other kernel methods. More 




h = 0.6 



Figure 4: Approximations of the IFBM kernel by spec¬ 
tral kernels with AT = 5 components. 

importantly, the families of kernels we propose are the 
first families of kernels that can approximate arbitrar¬ 
ily well any bounded continuous nonstationary kernel. 
Finally, our nonstationary extension is amenable to 
scalable inference either directly, or through a nonsta¬ 
tionary extension of random Fourier features approxi¬ 
mations. 
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which, because H is even, can be rewritten as 


Appendix 


In what follows, we use the ‘real’ (as opposed to an¬ 
gular) frequency convention for the Fourier transform. 
That is, if / is an integrable real-valued function on 
(i.e. /jjd \f{x)\dx < -foo), the Fourier transform of 
/ reads 

Fiuj) = T{f)iuj) = [ 

and the inverse Fourier transform is obtained as 

f{x) = [ 

jRii 

It is a direct consequence of our convention that 
F{F)ix) = fi-x). 


Appendix A Proof of Th. 3 


We want to prove the following theorem. 

Theorem: Let h be a real-valued positive semi- 
definite, continuous, and integrable function such that 
Vr € /i(r) > 0. The family of functions 

K 

^ akh{T © 7fc) cos( 27 rWfc r), (II) 

k=l 

with ujk,"fk G € K, iL G N* is dense in the 

family of stationary real-valued kernels with respect 
to the pointwise convergence. 

Proof Let us define 


h^ir) := h{T © 7 ) 


and 

fcg(T;7,w) := h-y^r) cos{2'kuFt). 

Firstly we note that the function fcg is integrable®, and 
therefore admits a Fourier transform. As h is inte¬ 
grable and even, it admits an even Fourier transform. 
Denoting FI = ^ih) the Fourier transform of h, and 0 
the element-wise division, by definition of the inverse 
Fourier transform, and using properties of the Fourier 
transform, it follows that 


J-(fcg)(w) 


n-=i7bi 


H{{u: 


Wfc) 07 ) 


-I- H {{uj + uk) 07 ) 


® Because the function hj is and the cosine function is 
bounded. 


J-(fcg)(w) = 


1 I 


nL 7 [j ]2 


H {{ujk - w) 07 ) 

-I- H {{uJk -I-w) 07 ) 


Let us now consider a stationary real-valued kernel k. 


Case 1 : ^sing. = 0 

When the singular part of the spectral measure of k 
in Lebesgue’s decomposition theorem is null, the spec¬ 
tral measure of k admits a density with respect to 
Lebesgue’s measure and we note 


dll 

duj 


= F. 


The density F is even as k is real-valued. It is 

also integrable®. Let us consider the function w —> 
— nH{uj 07 ). It is integrable and its Fourier 

i Lj=i 'y[j\ 


transform r —^ h-y{—T) is strictly positive everywhere. 
Hence, by Wiener’s Tauberian theorem. 


3 (ttfc, Wfe)fceN*, Q!fc G K, w/c G s.t. 


Vw G K‘^,F(a;) 


~|”CXJ ^ 


where the convergence is to be understood in the 
sense. As F is even, we also have 


I ^ 

F{uj) = F{-uj) = ^ afc—^- —H ((-w + Wfc) 0 7 ), 

k=i llj=i7b] 


so that 


+00 .. 




2n-=i7b] 


H{{ujk 


uj) 0 7 ) 


+ H {{uJk + w) 0 7 ) 


This proves that the family of spectral density func¬ 
tions 


K 


1 1 


H{{uJk 


oj) 0 l)+H {{ojk + w) 0 7 ) 


is dense in the family of Fourier transforms of inte¬ 
grable stationary kernels^® with respect to the conver¬ 
gence in of functions. Hence, as proved in the pa- 

F{uj)duj = A:(0) < -fcxj. 

i^We recall that Bochner’s theorem implies that the 
spectral density function of an integrable kernel is its 
Fourier transform. 























per, the corresponding family of inverse Fourier trans¬ 
forms, namely 

K 

otkh^{r) cos(27ra;fcr), 

k^l 

is dense in the family of integrable real-valued 
stationary kernels with respect to the pointwise 
convergence of functions. This result also holds for 
the superset ‘^os(27ra;^r) by definition 

of the density property. Moreover, as the cosine 
function is even, the density property is preserved 
after imposing the constraint ojk € 

Case 2 : //cont. ~ 0 

We now deal with the case where the continuous part 
of the spectral measure of k in Lebesgue’s decomposi¬ 
tion theorem is null. A refined version of Lebesgue’s 
decomposition theorem states that the singular mea¬ 
sure can be uniquely decomposed as /Xsing. = Mpp. +^J^sc. 
where ^pp, is a discrete (pure-point) measure and ^sc. 
is mutually singular with Lebesgue’s measure. The 
singular continuous measure /Tsc. is not intuitive as 
it gives null probability mass to any countable set of 
‘outcomes’, and yet it gives positive probability mass 
to some sets of outcomes with null ‘volume’ (Lesbe- 
gue’s measure). For those reasons, we believe singular 
continuous measures to be of limited interest in most 
statistical inference problems involving stationary ker¬ 
nels, and we will restrict our attention to discrete mea¬ 
sures in this section (i.e. /isc. = 0). 

We recall from Eq. (3) that the stationary covariance 
functions arising from discrete positive and symmetric 
spectral measures can be written as: 

+00 

fcsing.(T) = ^ ttfc cos(27rWfcT), 

fc=l 

with afe > 0 and < +oo. Moreover, as h 

is positive-valued and positive semi-definite, we have 
that 

det 

which implies 

Vy > 0, 0 < < h{0). 

Hence, 

Vy, - |afcCOs(27ra;fcT)|. 

As J2k=i |afeCOs(27ra;fT)| < J2k=i0^k < +oo, by the 


dominated convergence theorem we have that 
+ 00 +00 

k—1 ^ ' k—1 ^ ' 

— ^sing. ) ■ 

Hence, any stationary kernel whose spectral measure 
is a pure-point measure is the limit of kernels of the 
form cos(27ra;^r) (as y^ go to 0 and K 

goes to -l-oo), which concludes the proof in the second 
case. 

Case 3 : //sing. 7 ^ 0 and ^cont. 7 ^ 0 

In the general case, we decompose any covariance func¬ 
tion k as 

k — ^sing. T A:cont.5 

with fcsing,(r) = and 

A:cont.(T) = /jjd We then use 

the two cases previously discussed to conclude that 
k is the limit of linear combinations of kernels of the 
form A:g(r; y, w). ■ 


Appendix B Proof of Prop. 5 

We now prove the following proposition. 

Proposition: A mean zero stationary Gaussian pro¬ 
cess with stationary generalized spectral covariance 
function is p times continuously differentiable in the 
mean square sense if and only if a mean zero station¬ 
ary Gaussian process with covariance function h is. 

Proof p times differentiability of a stationary 
GP in the mean square sense is equivalent to 2p 
times differentiability of its covariance function at 
0 (Adler and Taylor (2011)). It is easy to see that 
if h is 2p times differentiable at 0, then so will the 
corresponding stationary generalized spectral kernel. 
Reciprocally, a simple reasoning by contradiction 
allows us to conclude that if h is not at least 2p 
times differentiable at 0, h^^{T) cos{2ttiJ^t) and 
subsequently the corresponding stationary spectral 
kernel cannot be. ■ 


Appendix C Proof of Th. 7 

In this section we prove Th. 7, which we recall below. 

Theorem Let {x,y) —>■ k*{x,y) be a real-valued posi¬ 
tive semi-definite, continuous, and integrable function 


h{0) h{T(D"f) 

h{TQj) h{0) 


> 0 , 










such that yx,y, k*{x,y) > 0. The family 


We can therefore rewrite fx as 


K 

kxix, y) := ^ akk*{x Qjk.yQ 'yk)'i’kix)'^^k{y) 

k=l 


where vl/,lri = f (^2Trxyi) + cos (2TTxyi) \ 

' ' sin (27ra;^w^) + sin (27ra;^a;^) J ’ 

with 7 fc e S G K, K gW is dense 

in the family of real-valued continuous bounded non¬ 
stationary kernels with respect to the pointwise con¬ 
vergence of functions. 


Proof k* being integrable, it admits a Fourier trans¬ 
form 

K*{UJI,UJ2) :=.F(r)(wi,c^ 2 ), 

and we have 


Vx, y, T{K*){x, y) = k*{-x, -y) > 0. 

Hence, the conditions of Wiener’s Tauberian theorem 
are met, so that any integrable function on x is 
the limit of linear combinations of translations of K*. 

Let khe a real-valued continuous bounded nonstation¬ 
ary kernel and ytp its Lebesgue-Stieltjes spectral mea¬ 
sure. We will start with the case where yp is abso¬ 
lutely continuous with respect to Lebesgue’s measure. 
In that case, denoting / the corresponding Radon- 
Nikodym derivative, we have: 

k{x,y) = J 

Noting that / is integrable, we can define 

K 

fK{uJl,UJ2) ■= ^/3fciG*(cUi +UjI,UJ2 + W^), 

k=l 

a sequence of linear combinations of translations of K* 
converging to / in the sense. We can always con¬ 
sider such a sequence {fx} with symmetric functions. 
In effect, for any candidate {fx}, {/k} with 

/if(wi,W 2 ) = - {fxi^l,^2) + /k(w 2 ,Wi)) , 

are symmetric, integrable, linear combinations of 
translations of K* as K* is symmetric, and converge to 
/ in the sense. As both fx and K* are symmetric, 
we also have: 


fx{uJi,U}2) — /if(w2,Wi) 

K 

'■= ^ PkK*{(^2 + Wfc, Wl -I- OjI) 

fc=l 

K 

= ^ ldkK*{uJi + ujI,uj 2 + iol). 

k=l 


/if(wi,W2) := 

W + Wfc) -I- K*{lOi + Wfc, W2 + Wfc) 

k=l ^ 

K* (wi -|- UJ^. , UJ2 UJ ^.) -|- K* (wi -|- , L02 + ^ 

_ (h + ojI) + K*{u}i + 0jI,LO2 + ujI)^ ■ 

Denoting 

k{x,y) := J 


it is easy to see, by applying Jensen’s inequality to 
\k(x, y)—k{x, y)\ like we did in the stationary case, that 
k converges to k pointwise. Thus, as k is real-valued, 
the real-part of k converges to k pointwise too. Simple 
changes of variables using the expanded expression of 
fx{uji,uj 2 ) give us the following expression for the real- 
part of k: 


Re (J;{x, y)j = ^ ^k*{x, y) ^ 

cos (27ra;'^(a;^ - y'^wl)) + cos (27r(a:^a;^ - 
-I- cos (27r(a; - y)'^ oj\) + cos (27r(x - y)^ xil) ^ 


( 12 ) 


+ 


K 


^ -^k*{x, y) cos (27r(x - yf wl) 


K 


^ -^k*{x, y) cos (27r(x - y^wl) 


k=l 


If we denote ^a.b(a;) = 

f cos (27rx^a)-I-cos (27rx^5) \ , ,. 

[ sin ( 27 rx^a) + sin (2^x^6) ) ’ expanding 

the cosine functions in Eq. (12), it follows that each 
of the three sums above can be rewritten in the form 


K 

y^afcfc*(x,y)^fc(x)'^^fc(^), 

fc=i 


where for the first sum we have ak = ^, ^fe(a:) = 
for the other two sums ttk = for 

the second sum 5'fc(x) = (x) and for the last 

sum 5'fc(x) = 2 2 (x). This proves that for any 

real-valued continuous bounded nonstationary kernel 
k with absolutely continuous spectral measure, there 
exist a sequence of the form 


K 

kK{x,y) = 

/c=l 











that converges to k. 

As for pure-point spectral measures, they can be writ¬ 
ten as 

+00 

A 2 ) = (13) 

k^l 

with l/3fc| < + 00 . By symmetry of /r_F, Eq. (13) 

may be rewritten as 

-1-00 0 

fe = l 

and using the same trick as in the integrable case, we 
get that the corresponding kernels are of the form 

+ 00 

k^l 

where we also have \^k\ < +oo- As both 

(x,y) ^ki.xY'^kiy) and (a;, y) k*{xQ^,yQ'y) 
are bounded, we may once again use the dominated 
convergence theorem to conclude that 

+ 00 

k{x,y) := lim akk*(x Q ^,y Q j)^kix)'^^k{y) 

^^ 7-)-0 
fc = l 

+ 00 

= lim akk*{x 0 7 , y 0 7 )^fc(a^)^^fc(y) 

7—)-0 

K 

= lim V'afefc*(a;© 7 ,y© 7 )^'fe(x)^^'fe(y), 

7 —>-+oo ^' 


to the author that are not harmonizable ‘are rather 
complicated and have some unusual, even pathological 
properties’. Moreover, Kakihara (1985) proved that, 
if a real-valued bounded function k is the covariance 
function of a mean square continuous stochastic pro¬ 
cess {f{x)), providing that the mapping x —)■ f{x) is 
(strongly) measurable, k is harmonizable. The forego¬ 
ing condition will often be verified by stochastic pro¬ 
cesses of practical interest, which is the reason why, as 
did Genton (2002), we have ignored this technicality 
in Th. 6 . 


where we have used the continuity of k* at ( 0 , 0 ) 
and fc*(0,0) = 1. This concludes the proof for the 
pure-point case. Hybrid cases are dealt with in a 
similar manner to the proof of Th. 3. ■ 


Appendix D Technical discussion on 
Th. 6 

Every real-valued continuous bounded function of the 
form of Eq. ( 8 ) is indeed the covariance function of 
a real-value mean square continuous stochastic pro¬ 
cess. Strictly speaking, for a real-valued continuous 
bounded function that is also the covariance func¬ 
tion of a real-valued mean square continuous stochas¬ 
tic process to be of the form of Eq. ( 8 ), it has to 
be harmonizable (Yaglom (1987); Kakihara (1985)). 
However, as noted by (Yaglom, 1987, pp. 464), the 
only continuous bounded covariance functions known 

being continuous and integrable is bounded. 




